issue is that multiple candidates for functional replacement are created for each 
basic block of code, which may overburden the cache and the function 
compilation process. 

In Figure 10, the freeze-cache feature of the FLU is used by the CCC 
operational flow in order to separate the two functions of advice (notifying the 
FLU of a candidate for functional replacement) and consent (allowing the FLU to 
execute a replacement function, skipping over the instructions immediately at 
hand). In this scheme, the CCC yields to FLU with the lookup cache un-frozen 
on every transfer of control, and after every successful replacement function 
execution. This allows all basic blocks that are not yet replaced to be considered 
as candidates, and just once, on entry to the block. After that, the FLU is 
consulted on successive instructions with the lookup cache frozen, so that any 
pre-existing functions will be found, but no new candidates will be created. In 
this way, the FLU is consulted for a replacement on every instruction, but the 
function compilation process is just notified once per candidate basic block. This 
process is responsible to create functions for register-only code and attach them to 
the proper entry addresses, as before. This scheme still has the disadvantage that 
the FLU is consulted for a replacement function on every instruction execution. It 
is also more complex than those in figures 4 and 9. 

The CCC/FLU interaction schemes of figures 4, 9, and 10, as well as many 
others, can all be implemented with the same elements, structures, and signals of 
the proposed invention. 

REMARKS 

Applicant received an Office Action dated 5/13/03 from Examiner Scott 
Collins for Ser. No. 09/477,047, filed 12/31/99 (and Request for Continued 
Examination filed March 25, 2003). While working on a response Applicant 



4 



realized the need to add the above material to the patent application. Therefore 
Applicant is filing the present CIP and wishes to additionally address the prior art 
which Examiner Collins brought up in his office action of 5/13/03. The 
following remarks address the prior art and the reason for adding material to the 
patent application. 

Claims 1-9 are still in this case. Applicant has changed FIG. 5 to be 
consistent with the text of the application. In addition, FIGs. 9 and 10 have been 
added and a continuation-in-part application has been filed. The Examiner has 
withdrawn his objections to the ABSTRACT and to the drawings. The Examiner 
has introduced another reference Yard (5,892,934) on which he is relying. 
Applicant will address the differences of Yard and applicant's invention first. 

DIFFERENCE BETWEEN YARD AND APPLICANT'S INVENTION 
In the Yard reference, microprocessor 12 needs to fetch and decode (i.e. 
from memory cache 42) a subroutine call instruction having a target address of a 
DSP function and then transmit the target address to the DSP 14. The DSP then 
executes the routine stored in the target address. At the conclusion of the 
subroutine code sequence, a corresponding subroutine return instruction is 
fetched from cache 42 and executed in microprocessor 12. The subroutine return 
instruction uses the sequential address stored by the most recently executed call 
instruction as a target address, (col.3, line 25 - col. 4, line 4). Every DSP function 
performed by DSP 14 must be planned and programmed into the instruction 
memory cache 42 by the programmer. This includes both the subroutine call and 
subroutine return instruction fetches. In addition, the DSP address table must be 
loaded with explicit values (subroutine addresses) after the program has been 
compiled and linked. If the program in microprocessor 12 is changed, or if the 
DSP mechanism is to be used with other new general purpose programs, the DSP 
address table must be reloaded with different values. Manual programmer 
intervention to update the DSP tables is necessary for such new general programs. 



5 



Yard provides no disclosure how to update such DSP programs automatically. 
One disadvantage of the Yard approach is that an instruction fetch is needed to 
execute a DSP function so it is subject to the "Von Neumann bottleneck" 
problem. A second disadvantage is that every DSP function must be programmed 
into the program in microprocessor 12 and the DSP table updated for every new 
application program used. This means recompiling every time a new application 
program is used. 

In contrast, applicant's invention, includes a traditional CPU (i.e. CCC 12 
in FIG. 1) and a logic function cache ("FLU") (14 in FIG. 1) containing a table of 
specialized accelerating functions each of which is identified by an address. The 
functions are generated by preloading and synthesized on-the-fly by an exception 
routine (page 3, lines 24-28 and page 5, linesl 1-20) to be subsequently described. 
An address from program counter (PC) 16 is presented to both the conventional 
CCC 12 and the FLU 14 so that one of them can execute the required function. If 
there is not a match for the PC 16 address in the FLU table, the FLU and CCC will 
coordinate so that the CCC can perform the function. If there is a match in the 
FLU table, the FLU performs the function associated with the address. At the 
same time the PC 16 address is presented to the CCC and FLU, a counter in the 
FLU keeps count of the number of times the PC 16 address has been presented. If 
it has been presented a number of times the FLU recognizes that this function is a 
"hot" function for which there should be a specialized accelerated function in the 
FLU. The exception routine will generate the function and put it in the FLU table 
for future use when the PC 16 address comes up again. The above sequence will 
be repeated as each PC 16 address is presented to the CCC and FLU. The above 
is an overview description of applicant's invention. The Examiner is referred to 
applicant's previous response dated March 25, 2003(Pages 4-6) for a detailed 
schematic description. 

Applicant's invention differs from Yard in that it does to not need to fetch 
instructions from instruction memory to perform a DSP function as does Yard. In 



6 



addition, it does not need a programmer to program in every DSP function in 
microprocessor 12 and update the DSP table every time a new application 
program is used. However, the real difference between applicant's invention and 
Yard is that applicant's invention is independent of anything in the program. The 
FLU has its own table of functions and watches the application program as it 
progresses by watching the PC 16 address. If it sees a PC 16 address that matches 
a function identifier in its function table then it performs the function. But just as 
importantly a counter keeps track of how many times a PC 16 address has been 
presented. If the PC 16 address appears frequently, the FLU considers it a "hot" 
address and requests that an exception routine generates an appropriate function 
for that address and put it in the FLU table so it may be used the next time the PC 
address comes up. All of this is done with no interaction with the program or 
with the CCC. The only interaction with the CCC is when the CCC and FLU 
coordinate to see which performs the function required by the PC 16 address. In 
this way, no matter how the application changes, or WHATEVER application is 
run, there is an acceleration mechanism which can detect hot program regions, 
translate them to logic, and then insert these logical functions (by loading the 
right PC values into the FLU cache, NOT by changing the program in memory or 
in the INSTRUCTION cache) into the program flow, all unbeknownst to the 
authors or maintainers of the original programs, and without requiring any 
specific knowledge of ANY program-only the semantics of sequences of 
conventional CPU instructions. 

REJECTION UNDER 35 USC 103 UNDER TRIMBERGER (5.752,035) 
IN VIEW OF YARD (5.892.934) 

Applicant notes the Examiner in paragraph 7 has cited the equivalency 
between a number of elements of Trimberger and those in Applicant's claim 1. 



7 



Applicant would like to remind the Examiner of Applicant's comments in his 
office action response of March 25, 2003. which state as follows: 

"In paragraph 10 the Examiner picks and chooses from among the 
various elements of Trimberger and collects them together to show 
Applicant's claim 1 . However, he leaves out other elements which 
are necessary to make the Trimberger invention work such as; the 
decoder (314 in FIG. 6), OPCODES (FIG. 6), and connection of the 
RISA processor to instruction memory or instruction cache (col. 13. 
lines 21-25). The collection of elements that the Examiner cites in 
paragraph 10 to anticipate Applicant's claim 1 does not function 
since it is missing the decoder, OPCODES, and connection from the 
RISA back to the instruction memory. These elements need to be in 
the Trimberger invention to make it work. Looking at FIG. 6 there 
would be a gaping hole in the schematic where elements would no 
longer be connected. Also because those elements are needed in the 
invention this means the elements of Trimberger are schematically 
ordered differently that the Applicant's invention and the 
connections and signals between the elements of Trimberger are 
different than for applicant's invention. The elements of 
Trimberger that the Examiner cites to be the same as applicant's 
invention have different connections and different input and output 
signals. Hence they can't be claimed to be equivalent elements to 
Applicant's." 

Applicant fails to understand how the Examiner can continue to insist on 
the equivalency of the elements in Trimberger with the elements in Applicant's 
invention in claim 1 when the schematic diagrams for the two circuits are entirely 
different and the input and output signals of the elements of the Trimberger 
schematic are different than those of Applicant's schematic. Applicant clearly 
pointed out in his March 25, 2003 response how Claim 1 mirrors the schematic of 



8 



his invention. Claim 1 recites an invention having a different schematic and 
different inputs and outputs for most of the elements than Trimberger. Applicant 
would also like to remind the Examiner that in Paragraph 18 of the Examiner's 
office action of 5/13/03 that He agreed applicant's arguments for claim 1 were 
persuasive. 

The differences between the combination of Yard and Trimberger and 
Applicant's invention are as follows. 
1. First difference 

VON NEUMANN BOTTLENECK PROBLEM OF YARD AND 

TRIMBERGER 
Both Trimberger and Yard fetch an instruction from instruction 
memory that tells them what function to perform in their respective 
accelerating processors. In the Yard reference, microprocessor 12 needs to 
fetch and decode (i.e. from memory cache 42) a subroutine call instruction 
having a target address of a DSP function and then transmit the target 
address to the DSP 14. The DSP then executes the routine stored in the 
target address. At the conclusion of the subroutine code sequence, a 
corresponding subroutine return instruction is fetched from cache 42 and 
executed in microprocessor 12 (Col.3, lines 25-40). Applicant emphasizes 
here that no DSP function is performed until a subroutine call instruction is 
fetched from the instruction cache 42 first. Similarly, in Trimberger both 
the conventional processor (Fixed ALU 301) and the accelerating 
processor (RISA 302,303) use an opcode to identify a function to be 
performed. Both processors need to fetch instructions containing an 
opcode from the instruction memory (cache 3 1 1). In addition Yard needs 
the subroutine return instruction. The need to fetch instructions before 
executing them is called the "Von Neumann bottleneck". The details are 
extensively spelled out in applicant's March 25, response and need not be 
repeated here since the Examiner agrees with them. 



9 



In contrast in applicant's invention the function lookup table (i.e. in 
FLU 14) simply looks for a match between the functional lookup table and 
the program counter and then performs a function associated with that 
program counter match. It does this each time the program counter 
advances. Applicant has no need to fetch any instruction for the functional 
lookup table and can't because there is no communication or connection 
with the instruction memory. Because there is no need to fetch an 
instruction from the instruction memory to obtain an opcode, the problem 
with the "Von Neumann bottleneck" is improved. The fact that both Yard 
and Trimberger need to fetch an instruction to indicate what function is to 
be performed teaches awav from applicant's invention which has no need 
to fetch such an instruction. When the Examiner looks at the use of 
matching addresses in the DSP table he must look at Yard as a whole. 
Yard first fetches the subroutine call instruction from the instruction cache 
and then finds matching addresses in the DSP table. This teaches away 
from Applicant's invention where there is no need to fetch an instruction 
to identify a function to perform. 

Applicant also takes issue with the Examiner's contention that: 

" it would have been obvious to a person skilled in the art to find a 

match between the tag field and an address. One of ordinary skill in the art 
would have been motivated to do this in order to relieve the processor from 
the burden of fetching and decoding instructions before detecting a match 
and thus by relieving this burden, the processor is able to execute more 
efficiently." Yard teaches away from this conclusion by the Examiner 
since Yard teaches that first the subroutine call instruction is fetched from 
the instruction cache and then a matching address is sought in the DSP 
table. Yard teaches one skilled in the art to fetch an instruction from 
instruction memory to identify the function to be performed and nothing 
about relieving the burden of fetching instructions. There is not one word 



10 



in either the Trimberger or Yard reference about addressing the issue of 
the fetching of instructions and its impact on "Von Neumann Bottleneck" 
problems. In addition, neither Trimberger nor Yard address the issue of 
eliminating the need to reprogram the conventional processor to change 
opcodes (and other information) each time a new application program is 
used. If these issues were as easy to recognize as the Examiner contends 
Trimberger and Yard would both have recognized them. There are a huge 
number of issues to be considered in designing a processor and just as 
many design options and tradeoffs to address those issues. How would a 
designer know to pick the issue of the Von Neumann bottleneck out of the 
huge number of design issues when Yard teaches away. The designer 
would not know unless he were aware of the inventor's teachings. The 
Examiner is using hindsight after seeing Applicant's teaching to say that 
the solution is obvious. 

2. Second Difference 

YARD AND TRIMBERGER NEED TO BE RECOMPILED EACH 
TIME A NEW APPLICATION IS USED 
The difference is as follows. The Trimberger patent has a standard 
processor to run application programs and an accelerating processor called 
a programmable execution unit (Fig 1, elements 21 and 30) The 
Trimberger patent is different from applicant's invention in that an 
instruction opcode is used to select the function to be performed by the 
programmable execution unit (Fig 1, elements 21 and 30). The importance 
of this is that before each new application program may be used it must be 
recompiled with NEW CPU INSTRUCTIONS and the correct opcodes 
updated in the instructions in the instruction memory (or cache). Similarly, 
in the Yard reference, microprocessor 12 needs to fetch and decode (i.e. 
from memory cache 42) a subroutine call instruction having a target 



n 



address of a DSP function and then transmit the target address to the DSP 
14. The DSP then executes the routine stored in the target address. (Col.3, 
lines 25-40). At the conclusion of the subroutine code sequence, a 
corresponding subroutine return instruction is fetched from cache 42. In 
Yard, opcodes must be updated in both the subroutine call and return 
instructions each time a new program is used. This requires programmer 
intervention, and recompiling with new CPU instructions having the 
correct opcodes. In addition, the DSP address table must be loaded with 
explicit values (subroutine addresses) after the program has been compiled 
and linked. If the program is to be changed, or if the DSP mechanism is to 
be used with other new general purpose programs, the DSP address table 
must be reloaded with different values. Manual programmer intervention 
to update the DSP tables is necessary for such new general programs. 
Yard provides no disclosure how to update such DSP programs 
automatically. In both cases, specific BINARY encodings of 
INSTRUCTION opcodes (in Yard, general subroutine opcode plus 
specific routine address) are associated and coordinated ahead of program 
run-time with specific logic functions. There must be synchronized, 
programmer-aware cooperation between the process of compiling and 
linking (i.e.building, creating) the program binary and the initialization and 
loading of the device containing the logic functions. 

In contrast, applicant's invention provides an automatic mechanism 
for detecting, compiling, and substituting logic functions (i.e. FLU 14) for 
arbitrary sequences of ordinary CPU instructions, WITHOUT programmer 
intervention, WITHOUT modification or coordination with or knowledge 
of the specifics of any program binary. Applicant's invention has a 
conventional processor to run applications programs and an associated 
functional lookup unit that acts as the special accelerating processor. The 
function lookup table simply looks for a match between the functional 



12 



lookup table and the program counter and then performs a function 
associated with that program counter match. Those functions are all in a 
function cache or are created by an exception routine which generates the 
functions on its own and has no connection or communication with the 
conventional processor. Because there are no opcodes to be updated and 
all new functions are created by an exception routine which does not 
communicate with the conventional processor there is no need to 
recompile the program in the conventional processor each time a new 
application is used. 

Applicant argued this in his response to the Examiner of December 
18, 2002 and reaffirmed the argument in the response of March 23, 2003 at 
page 5, lines 16-23 which states as follows. 

"In this invention, no new instructions are created in the instruction 
set of the original processor, and function indicators for the 
reconfigurable logic array NEVER appear in instruction memory, 
instruction cache, or anything that the ordinary CPU sees in its 
fixed instruction execution. Function selects for the reconfigurable 
array only appear in the logic function cache, and these invocations 
are triggered simply by a match of the PC address as last updated by 
the ordinary CPU with the tag in the cache entry (40 in FIG. 2)." 
In addition, applicant Mr. Luther Johnson is providing an affidavit 
as an expert stating the exact details of the above analysis of the 
differences of Trimberger, Yard and applicant's invention. This is in 
compliance with the request of Examiner Collins and his Supervisory 
Patent Examiner Mr. Eddie Chan during telephone conversations 
applicant's attorney had with Examiner Collins between March 4-24, 2003 
that Mr. Luther Johnson's comments would not have any weight unless 
they were introduced in the form of an Affidavit under C.F.R. 1.132. 



13 



ISSUE FROM APPLICANT'S MARCH 25. 2003 RESPONSE 
RELATED TO THE VON NEUMANN BOTTLENECK PROBLEM 

Applicant wishes to address an issue in the arguments made in his 
response filed March 25, 2003. The reason is as follows. Applicant in his 
original application discloses a broad invention (In FIGS. 1 and 2) having a 
conventional processor (CCC 12) and FLU 14 which operates independently of 
the CCC and does not need to fetch instructions (see page 12, lines 9-13, and 
page 13, lines 9-13). This broad invention (i.e. claim 1) claims a CCC broadly and 
allows for a variety of CCC's that meet the criteria of the invention. In FIG.4 
applicant discloses a specific embodiment of a CCC which he believed was the 
best embodiment when the application was filed. 

Applicant since filing his application has found that the specific 
embodiment of the flow chart of FIG. 4 is of a suboptimal design. In looking at 
the flow chart, line 8 requires the value of M to be determined (M=0 means the 
next instruction is not a memory accessing instruction; M=l means it is). If M=0, 
this means the FLU could perform the function. Applicant believes that it is 
necessary to fetch an instruction from the instruction memory to obtain 
information so as to determine M. The purpose for fetching this instruction 
information is not the same as it is in Trimberger and Yard where opcodes are 
sought to identify each new function to be performed by the accelerating 
processor. Nevertheless, in the specific embodiment of FIG. 4 the advantages of 
applicant's invention are not fully realized with respect to the "Von Neumann 
Bottleneck" problem. The embodiment of FIG. 4 still has the advantage that the 
FLU makes no demands on and operates independently of the architecture of the 
conventional processor CCC. Also the FLU it is independent of any instruction 
set. Therefore, recompilation is not necessary each time a new application is 
used. Applicant has added FIGs. 9 and 10 which show alternative embodiments 
of CCC 12. Fig. 9 is the general case where the FLU is consulted before every 



14 



instruction execution, eliminating the need to fetch any instruction at all if a 
match is found. In Fig. 10 the FLU counts occurrences of basic block executions 
(and creates new cache entries) only on the first instruction after a transfer of 
control. These embodiments do not need to compute M and hence do not need to 
fetch an instruction to compute M. They do not have the "Von Neumann 
bottleneck" problem. 

Applicant has amended FIG. 5 (changes shown in red) so that the output 
arrow from the "EXECUTE VIA RCA" now leads back to the arrow between the 
"YES" choice of the output "GO?" diamond, and the entry of the "match" 
diamond. Applicant's invention simply checks with the FLU every time between 
the instruction executions. The guarantees that the invention will move from 
logic function to logic function within the FLU without going back to the CCC 
unnecessarily and therefore, without any instruction fetching. There is no 
instruction fetching by the FLU, and therefore, no Von Neumann bottleneck. The 
change in the drawing is necessary so that it is consistent with the specification at 
page 7, lines 2-14 which specifies that the FLU can execute ftinction after 
function before transferring control back to the CCC. 

With respect to paragraphs 15-17 of the Examiner's office action, both 
claims 6 and 9 are dependent on claim 1 and are patentable for the same reasons 

In view of the above changes and remarks, allowance of the claims is 
respectfully requested. 

Should the examiner have any questions he is invited to call Applicant's 
attorney at the number given below. 




15 



